SPECIFICATION 



Electronic Version 1 .2.8 
Stylesheet Version 1 .0 

Method For Detecting Current Client- 
Side Browser Encoding 

Background of Invention 

[0001] The world wide web is being used by millions of users around the world, with different 

languages. TCP/IP and HTTP protocols transmit data between server and client, in most cases not 
having the exact knowledge of the language and encoding that the client-side user uses. While 
Unicode covers all known languages and characters, its encodings, UTF-8 and UTF-1 6, are very 
y3 rarely used as a standard for information exchange. Instead, some languages use several different 
X* encodings. For instance, there are two widely-used Russian encodings, and two more, less widely 
fy used. Many languages have one encoding for Windows operating system and another for DOS. 
Si Linux and Unix often use one more encoding; e.g. in Japanese, Shift JIS is widely (but not always) 
~ 5 used on Windows, and EUC-JP is widely (but not always) used on Linux and Unix. 

[0QQ2] Ordinary users around the world do not know and often do not care what encoding they have. 

It can be a problem when the user downloads a page in a different encoding, but this is solved by 
D specifying page encoding inside HTML. When the users sends a form to the server, though, the 
server cannot find out the client-side encoding, and can either guess, or keep the data as 
received, in whatever encoding it was. 

[0003] This makes searches in international databases almost impossible: for instance, the same set 
of codes can correspond to different characters in different languages. This also makes it 
impossible to store data in the server databases in encoding-independent way (which basically 
means in Unicode). 

[0004] Some web sites solve this problem by having different pages for different languages; which is 

still a partial solution for the languages that have several encodings; and since the users, as 
experience shows, do not know their encoding, the data they supply cannot be always correctly 
parsed. 
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[0005] Another solution is to retrieve from HTTP request header encodings that are enabled on the 
client side. This gives only a hint on which languages can be installed on the user's computer. In 
some occasions it can be enough, when there is one language that has one encoding; in other 
occasions it is not enough, for instance in the case of a computer being used for Japanese- 
Ukrainian translation. In this latter case the computer will have at least two languages installed, 
each of the languages having three different encodings: we have to choose between 7 (add 
English) encodings. 

[0006] If the browsers made current encoding available in a JavaScript object on the web page, or to 

the server in the HTTP request, this would be a solution, but unfortunately this is not so: browsers 
do not provide this information. 

[0007] 
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Summary of Invention 

[0009] The present invention solves the problem of browser encoding detection. The result of 

detection can be used in a JavaScript program or in a Java applet to adapt the contents depending 
on the encoding. The result can also be passed to the server, either in consequent HTTP requests, 
or with the form data. If the form data are accompanied by the encoding name, then the data can 
be uniquely converted into encoding-neutral Unicode strings. 

[001 0] The method consists of creating an invisible form in the HTML document, with the only hidden 
input field that contains Unicode character codes for a sample Unicode string, and matching parts 
of the sample Unicode string with characters or sequence of characters in various specific 
encodings; when the characters match, the encoding is detected. 

Detailed Description 

[OaiT] The browser encoding is detected in a piece of JavaScript code that is placed in the very top of HEAD part 
%0 HTML page, before any body text is written to the document. First, a form is written to the document, with th 
m hidden input the value of which is the sample Unicode string, e.g.:document.writeC<form 
j;j name=VP_encodingxinput name=t type^hidden 
5 value-*АÀ&#260 

f. 5 JavaScript also contains a function, VP_getEncoding(), that returns the current encoding name. 

[0iJ2] The function works like this:First, it splits the sample Unicode string into two samples, one for 

multi-byte encodings ( multi-byte sample ), another for Utf-8 and single-byte encodings ( single- 
O byte sample ). 

[001 3] The second step detects Utf-8 encoding by comparing the single-byte sample to the same 
string directly encoded using Utf-8. If the comparison is positive, the algorithm stops. 

[0014] The third step compares the multi-byte sample string to the same string encoded in Big5 
Chinese, GBK Chinese, EUCLTW Chinese, EUCJP Japanese, SJIS Japanese (the list can be easily 
extended). Note that the multi-byte sample string is padded with space character, to make it a 
valid sequence of bytes when the encoding is Utf-8. 

[001 5] The fourth step compares one or two characters of single-byte sample strings to the 

characters directly encoded using different single-byte encodings. Note that the character cannot 
be stored alone in the string, but instead has to be padded with space character, to make the 
sequence legal in Utf-8 encoding. The set of encoding samples can be easily expanded. 
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[001 6] if the fourth step does not detect the encoding, Vis returned. 

[001 7] The function VP.getEncodingO, can be later used in JavaScript later on the page, or in event 
handling routines, and the result can be passed back to the server if needed. 

Program Listing Deposit 

<HTMLXHEADXTITLE>Encoding test</titleXMETA HTTP-EQUIV=" Pragma" CONTENT- rr no»cache" 

[]detlb = new int[] { 1040, 192, 260, 270, 901, 287 }; 

Cyr West CtrE Bait GR Turk 
(with prev) 

[]det2b - new int [] { 0x500b }; 

dbl/utf 

<form name="__unicode_"> 

<inpur name="clb" type="hidden" value-"<% 
for (int i = C; i < detlb. length; i++) { 
out. print ("&#" + de-lb [i] + ";"); 

} 

%>"x/input> 

<input name-"t2b" type="hidden" value="<% 
for (int i = 0; i < det2b. length; i++) { 
out. print ("&# n + det2b[i] + ";"); 

} 

%> "></ input > 
</f orm> 
<hr> 

<script language^" javascript "> 

<% String [] b2 = new String [] { "UTE8 " , "\u00e5\u008 0\u008b" , 

n Big5", "\u00ad\uQ0d3", 



<% 

int 

// 
// 

int 

// 

%> 
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"GBK", "\u0082\uQ080", 
"EUC_TW", n \uQ0d4\u00b6", 
"EUC_JP" , " \u00b8 \u00c4 " , 
"SJIS", "\u008c\u00c2" }; 

String [] bl - new String [] { 

"UTF-8", "\uQ0d0\u0090\u0Gc3\u00£ 

"Central-European Windows", " \u00a5\u00cf 

"Central-European ISO" , " \u00al\u00cf ", 

"Baltic ISO", " \u00al 

"Cyrillic DOS", "\u0080 ", 

"Baltic Windows", " \u00c0 ", 

"Cyrillic Windows", "\u00c0 ", 

"Cyrillic KOI-8", "\u00el ", 

"Cyrillic ISO" , "XuOObO ", 

"Turkish", " \u00c0 \u00f0 

"ISO_885 9_l", " \u00c0 ", 

"Greek ISO", " \u00b5 

"Greek Windows", rr \u00al 

}; 



function VP__getEncoding ( ) { 
var encoding = "?"; 

var tl = document . forms .__unicode__* t lb. value; 
var t2 = document. forms. _unicode_.t2b. value; 
<% // Check for multibyte stuff 

for (inr i - 0; i < b2. length; i+=2) { %> 

<%= i > 0 ? "else " : "" %> if (t2 == "<%= b2[i+l] %> ") { 

encoding = "<%= b2[i] %>"; 
}<% 
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// Check for single-byte stuff 

for (int i - 0; i < bl. length; i+=2) { %> 
if (encoding == "?") { 

<% 

String originalSample - bl[i+l]; 
String workingSample = 11 "; 

int[] chosen = new int [originalSample . length ()] ; 

for (int j - 0; j < originalSample - length () ; { 
char c = originalSample . charAt ( j ) ; 
if (c !='»){ 

chosen [workingSample. length () ] = j; 

workingSample +^ c; 

} 

} 

if (workingSample. length;} == originalSample . length (} ) { 

%> 

if (tl ===== "<%= originalSample %>") { 

<% 

} else { 

%> 

test = "<%= originalSample %> "; 
if ( <% 

for (int j = 0; j < workingSample . length () ; { 
%><%= j > 0 ? ") && " : ""%> (tl. charAt (<%= chosen[j] %>) == test . charAt (<%== chosen [ 

}%>)) { 
<% } %> 

encoding - "<%- bl[i]%>"; 
} 

} 

<%}%> 

return encoding ; 

} 
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documenr. write ("Encoding is <font color=redxb>" + VP_getEncoding ( ) + "</bx/fontxb 

</script> 

</BODY> 

</HTML> 
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